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Abstract 


The goal of this paper is to compare several widely used Bayesian model selection methods in 
practical model selection problems, highlight their differences and give recommendations about the 
preferred approaches. We focus on the variable subset selection for regression and classification and 
perform several numerical experiments using both simulated and real world data. The results show 
that the optimization of a utility estimate such as the cross-validation (CV) score is liable to finding 
overfitted models due to relatively high variance in the utility estimates when the data is scarce. 
This can also lead to substantial selection induced bias and optimism in the performance evaluation 
for the selected model. From a predictive viewpoint, best results are obtained by accounting for 
model uncertainty by forming the full encompassing model, such as the Bayesian model averaging 
solution over the candidate models. If the encompassing model is too complex, it can be robustly 
simplified by the projection method, in which the information of the full model is projected onto 
the submodels. This approach is substantially less prone to overfitting than selection based on 
CV-score. Overall, the projection method appears to outperform also the maximum a posteriori 
model and the selection of the most probable variables. The study also demonstrates that the 
model selection can greatly benefit from using cross-validation outside the searching process both 
for guiding the model size selection and assessing the predictive performance of the finally selected 
model. 

Keywords: Bayesian model selection, cross-validation, reference model, projection, selection 

bias 


1. Introduction 

Model selection is one of the fundamental problems in statistical modeling. The often adopted view 
for model selection is not to identify the true underlying model but rather to find a model which 
is useful. Typically the usefulness of a model is measured by its ability to make predictions about 
future or yet unseen observations. Even though the prediction would not be the most important 
part concerning the modelling problem at hand, the predictive ability is still a natural measure for 
figuring out whether the model makes sense or not. If the model gives poor predictions, there is not 
much point in trying to interpret the model parameters. We refer to the model selection based on 
assessing the predictive ability of the candidate models as predictive model selection. 

Numerous methods for Bayesian predictive model selection and assessment have been proposed 
and the various approaches and their theoretical properties have been extensively reviewed by Vehtari 
and Ojanen (2012). This paper is a follow-up to their work. The review of Vehtari and Ojanen (2012) 
being qualitative, our contribution is to compare many of the different methods quantitatively in 
practical model selection problems, discuss the differences, and give recommendations about the 
preferred approaches. We believe this study will give useful insights because the comparisons to 
the existing techniques are often inadequate in the original articles presenting new methods. Our 
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experiments will focus on variable subset selection on linear models for regression and classification, 
but the discussion is general and the concepts apply also to other model selection problems. 

A fairly popular method in Bayesian literature is to select the maximum a posteriori (MAP) 
model which, in the case of a uniform prior on the model space, reduces to maximizing the marginal 
likelihood and the model selection can be performed using Bayes factors (e.g. Kass and Raftery, 
1995; Han and Carlin, 2001). In the input variable selection context, also the marginal probabilities 
of the variables have been used (e.g. Brown et ah, 1998; Barbieri and Berger, 2004; Narisetty and 
He, 2014). In fact, often in the Bayesian variable selection literature, the selection is assumed to be 
performed using one of these approaches, and the studies focus on choosing priors for the model space 
and parameters of the individual models (see the review by O’Hara and Sillanpaa, 2009). These 
studies stem from the fact that sampling the high dimensional model space is highly nontrivial, and 
because it is well known that both the Bayes factors and the marginal probabilities are sensitive to 
the prior choices (e.g. Kass and Raftery, 1995; O’Hara and Sillanpaa, 2009). 

However, we want to make a distinction between prior and model selection. More specifically, we 
want to address the question, given our prior beliefs, how should we choose the model? In our view, 
the prior choice should reflect our genuine beliefs about the unknown quantities, such as the number 
of relevant input variables. For this reason our goal is not to compare and review the vast literature 
on the priors and computation strategies, which is already done by O’Hara and Sillanpaa (2009). 
Still, we feel this literature is very essential, giving tools for formulating different prior beliefs and 
performing the necessary posterior computations. 

In other than variable selection context, a common approach is to choose the model based on its 
estimated predictive ability, preferably by using Bayesian leave-one-out cross-validation (LOO-CV) 
(Geisser and Eddy, 1979) or the widely applicable information criterion (WAIC) (Watanabe, 2009), 
both of which are known to give a nearly unbiased estimate of the predictive ability of a given 
model (Watanabe, 2010). Also several other predictive criteria with different loss functions have 
been proposed, for instance the deviance information criterion (DIG) by Spiegelhalter et al. (2002) 
and the various squared error measures by Laud and Ibrahim (1995), Gelfand and Ghosh (1998), 
and Marriott et al. (2001) none of which, however, are unbiased estimates of the true generalization 
utility of the model. 

Yet an alternative approach is to construct a full encompassing reference model which is believed 
to best represent the uncertainties regarding the modelling task and choose a simpler submodel that 
gives nearly similar predictions as the reference model. This approach was pioneered by Bindley 
(1968) who considered input variable selection for linear regression and used the model with all 
the variables as the reference model. The idea was extended by Brown et al. (1999, 2002). A 
related method is due to San Martini and Spezzaferri (1984) who used the Bayesian model averaging 
(BMA) solution as the reference model and Kullback-Leibler divergence for measuring the difference 
between the predictions of the reference model and the candidate model instead of the squared error 
loss used by Bindley. Another related method is the reference model projection by Goutis and 
Robert (1998) and Dupuis and Robert (2003) in which the information contained in the posterior of 
the reference model is projected onto the candidate models. The variations of this method include 
heuristic Lasso-type searching (Nott and Leng, 2010) and approximative projection with different 
cost functions (Tran et al., 2012; Hahn and Carvalho, 2015). 

Although LOO-CV and WAIC can be used to obtain a nearly unbiased estimate of the predictive 
ability of a given model, both of these estimates contain a stochastic error term whose variance can be 
substantial when the dataset is not very large. This variance in the estimate may lead to overfitting 
in the selection process causing nonoptimal model selection and inducing bias in the performance 
estimate for the selected model (e.g. Ambroise and McLachlan, 2002; Reunanen, 2003; Cawley and 
Talbot, 2010). The overfitting in the selection may be negligible if only a few models are being 
compared but, as we will demonstrate, may become a problem for a larger number of candidate 
models, such as in variable selection. 
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Table 1: Categorization of the different model selection methods discussed in this paper. 


View 


Section 2.2 
Section 2.3 
Section 2.4 


Generalization utility estimation (A^-open view) 
Self/posterior predictive criteria (Mixed view) 
Reference model approach (A4-completed view) 


Section 2.5 Model space approach (Ai-closed view) 


Methods 

Cross-validation, WAIC, DIC 
and L|-criteria 
Reference predictive method, 
Projection predictive method 
Maximum a posteriori model. 
Median probability model 


Our results show that the optimization of CV or WAIC may yield highly varying results and 
lead to selecting a model with predictive performance far from optimal. From the predictive point 
of view, best results are generally obtained by accounting for the model uncertainty and forming the 
full BMA solution over the candidate models, and one should not expect to do better by selection. 
Our results agree with what is known about the good performance of the BMA (Hoeting et ah, 
1999; Raftery and Zheng, 2003). If the full model is too complex or the cost for observing all the 
variables is too high, the model can be simplified most robustly by the projection method which 
is considerably less vulnerable to the overfitting in the selection. The advantage of the projection 
approach comes from taking into account the model uncertainty in forming the full encompassing 
model and reduced variance in the performance evaluation of the candidate models. Overall, the 
projection framework outperforms also the selection of the most probable inputs or the MAP model, 
while both of these typically perform better than optimization based on CV or WAIC. 

Despite its advantages, the projection approach has suffered from a difficulty in deciding how 
many variables should be selected in order to get predictions close to the reference model (Peltola 
et ah, 2014; Robert, 2014). Our study shows that the model selection can highly benefit from using 
cross-validation outside the variable searching process both for guiding the model size selection and 
assessing the predictive performance of the finally selected model. Using cross-validation for choosing 
only the model size rather than the variable combination introduces substantially less overfitting due 
to greatly reduced number of model comparisons (see Sec. 4.4 for discussion). 

The paper is organized as follows. In Section 2 we shortly go through the model selection 
methods discussed in this paper. This part of the paper somewhat overlaps with the previous 
studies (especially with Vehtari and Ojanen, 2012), but is included for maintaining easier readibility 
as a standalone paper. Section 3 discusses and illustrates the overfitting in model selection and the 
consequent selection induced bias in the performance evaluation of the chosen model. Section 4 is 
devoted to the numerical experiments and forms the core of the paper. The discussion in Section 5 
concludes the paper. 

2. Approaches for Bayesian model selection 

This section discusses shortly the proposed methods for Bayesian model selection relevant for this 
paper. We do not attempt anything close to a comprehensive review but summarize the methods 
under comparison. See the review by Vehtari and Ojanen (2012) for a more complete discussion. 

The section is organized as follows. Section 2.1 discusses how the predictive ability of a model is 
defined in terms of an expected utility and Sections 2.2-2.5 shortly review the methods. Following 
Bernardo and Smith (1994) and Vehtari and Ojanen (2012) we have categorized the methods into 
Al-closed, A4-completed, and Al-open views, see Table 1. A4-closed means assuming that one of 
the candidate models is the true data generating model. Under this assumption, one can set prior 
probabilities for each model and form the Bayesian model averaging solution (see Section 2.5). The 
A4-completed view abandons the idea of a true model, but still forms a reference model which is 
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believed to be the best available description of the future observations. In the Al-open view one 
does not assume one of the models being true and also rejects the idea of constructing the reference 
model. 

Throughout Section 2, the notation assumes a model M which predicts an output y given an 
input variable x. The same notation is used both for scalars and vectors. We denote the future 
observations by y and the model parameters by 9. To make the notation simpler, we denote the 
training data as D = {{xi,yi)}f^i. 

2.1 Predictive ability as an expected utility 

The predictive performance of a model is typically defined in terms of a utility function that describes 
the quality of the predictions. An often used utility function measuring the quality of the predictive 
distribution of the candidate model M is the logarithmic score (Good, 1952) 

u{M,y) = logp{y\D,M). (1) 

Here we have left out the future input variables x to simplify the notation. The logarithmic score is 
a widely accepted utility function due to its information theoretic grounds and will be used in this 
paper. However, we point out that in principle any other utility function could be considered, and 
the choice of a suitable utility function might also be application specific. 

Since the future observations y are unknown, the utility function u{M, y) cannot be evaluated 
beforehand. Therefore one usually works with the expected utilities instead 

m(M) =E[logp(i/|H,M)] = J ptiy)\ogp{y\D,M)dy, (2) 

where pt (y) denotes the true data generating distribution. This expression will be referred to as the 
generalization utility or more loosely as the predictive performance of model M. Maximizing (2) 
is equivalent to minimizing the Kullback-Leibler (KL) divergence from the true data generating 
distribution pt{y) to the predictive distribution of the candidate model M. 

2.2 Generalization utility estimation 
2.2.1 Cross-validation 

The generalization utility (2) can be estimated by using the already obtained sample D as proxy 
for the true data generating distribution pt{y)- However, estimating the expected utility using the 
same data D that were used to train the model leads to an optimistic estimate of the generalization 
performance. A better estimate is obtained by dividing the data into K subsets and using 

each of these sets in turn for validation while training the model using the remaining K — 1 sets. 
This gives the Bayesian A'-fold cross-validation (CV) utility (Geisser and Eddy, 1979) 

1 " 

AT-fold-CV = - ^ logp(?/i I Xi, , M), (3) 

^ • 1 
t — 1 

where denotes the validation set that contains the ith point and the training data from 

which this subset has been removed. Conditioning the predictions on fewer than n data points 
introduces bias in the utility estimate. This bias can be corrected (Burman, 1989) but small K 
increases the variance in the estimate. One would prefer to set K = n computing the leave-one-out 
utility (LOO-CV) but without any computational shortcuts this is often computationally infeasible 
as the model would need to be fitted n times. An often used compromise is K = 10. Analytical 
approximations for LOO are discussed by Vehtari et al. (2014) and computations using posterior 
samples by Vehtari et al. (2016). 
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2.2.2 Information criteria 

Information criteria offer a computationally appealing way of estimating the generalization perfor¬ 
mance of the model. A fully Bayesian criterion is the widely applicable information criterion (WAIC) 
by Watanabe (2009, 2010). WAIC can be calculated as 

1 V 

WAIC= - Vlogp(y*|x*,£>,M)-, (4) 

n n 

where the first term is the training utility and V is the functional variance given by 

n 

V = ^E[{\ogp{y^\xi,e,M)f] - E[logp(y* . (5) 

i=l 

Here both of the expectations are taken over the posterior p{0 \ D, M). Watanabe (2010) proved that 
WAIC is asymptotically equal to the Bayesian LOO-CV and to the generalization utility (2), and 
the error is o(l/n). Watanabe’s proof gives Bayesian LOO and WAIC a solid theoretical justification 
in the sense that they measure the predictive ability of the model including the uncertainty in the 
parameters and can be used also for singular models (the set of the “true parameters” consists of 
more than one point). 

Another still popular method is the deviance information criterion (DIC) proposed by Spiegel- 
halter et al. (2002). DIC estimates the generalization performance of the model with parameters 
fixed to the posterior mean 9 = E[9\D,M]. DIC can be written as 

1 ” 

DIC=-y'logp(y*|a:i,0,M)-—, (6) 

n n 

1—1 

where Peff is the effective number of parameters which can be estimated as 

n 

Peii — 2 E {logp{y^ I X*, 9, M) - E[logp(y* | Xi, 9, M)]) , (7) 

where the expectation is taken over the posterior. From Bayesian perspective, DIC is not theoreti¬ 
cally justified since it measures the fit of the model when the parameters are fixed to the posterior 
expectation and is not therefore an unbiased estimate of the true generalization utility (2). The use 
of a point estimate is questionable not only because of Bayesian principles, but also from a practical 
point of view especially when the model is singular. 

2.3 Mixed self and posterior predictive criteria 

There exists a few criteria that are not unbiased estimates of the true generalization utility (2) but 
have been proposed for model selection. These criteria do not fit to the Al-open view since the 
candidate models are partially assessed based on their own predictive properties and therefore these 
criteria resemble AI-closed/A4-completed view (for a detailed discussion, see Vehtari and Ojanen, 
2012 ). 

Ibrahim and Laud (1994) and Laud and Ibrahim (1995) proposed a selection criterion for re¬ 
gression derived by considering replicated measurements y at the training inputs. The criterion 
measures the expected squared error between the new observations and the old ones y over the 
posterior predictive distribution of the candidate model M. The error measure can be decomposed 
as 

n n 

= ^i.yi-'^[y\xi,D,M]f +^Y&Y[y\x,,D,M], ( 8 ) 

2=1 2=1 
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that is, as a sum of the squared errors for mean predictions plus sum of the predictive variances. 
The preferred model is then the one which minimizes (8). T^-criterion is not an unbiased estimate 
of (2) due to different form of the utility function, but Ibrahim and Laud (1994) showed that in 
model comparison, the criterion penalizes more complex models asymptotically with penalty halfway 
between the posterior predictive approach (i.e. fit at the training points) and the cross-validation 
approach. 

Marriott et al. (2001) proposed a closely related criterion which is a cross-validated version of (8) 


Li 


= - E 


y\x. 




^ Var 


y\x. 




M 


(9) 


This sounds intuitively better than the L^-criterion because it does not use the same data for training 
and testing. However, little is known about the properties of the estimate (9) as the authors do not 
provide a theoretical treatment. Empirically it is found that, like L^, L^.^,-criterion has a relatively 
high variance which may cause significant overfitting in model selection as discussed in Section 3 
and demonstrated experimentally in Section 4. 

Yet another related criterion based on a replicated measurement was proposed by Gelfand and 
Ghosh (1998). The authors considered an optimal point prediction which is designed to be close to 
both the observed and future data and the relative importance between the two is adjusted by a 
free parameter k. Omitting the derivation, the loss function becomes 


L 


2 _ 
k — 


k 

k + l 


- E[y I X*, D, M])2 + ^ Var[y | x„ D, M]. 
2=1 2=1 


( 10 ) 


When fc —>■ oo, this is the same as the L^-criterion (8). On the other hand, when fc = 0, the criterion 
reduces to the sum of the predictive variances. In this case there is no inherent safeguard against 
poor fit as the model with the narrowest predictive distribution is chosen. In their experiment, the 
authors reported that the results were not very sensitive to the choice of k. 


2.4 Reference model approach 

Section 2.2 reviewed methods for utility estimation that are based on sample reuse without any 
assumptions on the true model (A4-open view). An alternative way is to construct a full encom¬ 
passing reference model M*, which is believed to best describe our knowledge about the future 
observations, and perform the utility estimation almost as if it was the true data generating distri¬ 
bution (Al-completed view). We refer to this as the reference model approach. There are basically 
two somewhat different but related approaches that fit into this category, namely the reference and 
projection predictive methods, which will be discussed separately. 

2.4.1 Reference predictive method 

Assuming we have constructed a reference model M* which we believe best describes our knowledge 
about the future observations, the utilities of the candidate models M can be estimated by replacing 
the true distribution ptiy) in (2) by the predictive distribution of the reference model p{y \ D,Mi,). 
Averaging this over the training inputs {xi}2^i gives the reference utility 

1 n /. 

X! / I ^♦) M)dy . (11) 

2 = 1 

Depending on the reference model, this integral may not be analytically available and numerical 
integration may be required. However, if the output y is only one dimensional, simple quadratures 
are often adequate. 
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As the reference model is in practice different from the true data generating model, the reference 
utility is a biased estimate of the true generalization utility (2). The maximization of the reference 
utility is equivalent to minimizing the predictive KL-divergence between the reference model M* 
and the candicate model M at the training inputs 

1 " 

5{M^\\M) = -^KL{p{y\x^,D,M^)\\p{y\xi,D,M)). (12) 

n 

1—1 

The model choice can then be based on the strict minimization of the discrepancy measure (12), or 
choosing the simplest model that has an acceptable discrepancy. What is meant by “acceptable” 
may be somewhat arbitrary and depend on the context. For more discussion, see the concept of 
relative explanatory power in the next section. Equation (17). 

The reference predictive approach is inherently a less straightforward approach to model selection 
than the methods presented in Section 2.2, because it requires the construction of the reference model 
and it is not obvious how it should be done. San Martini and Spezzaferri (1984) proposed using the 
Bayesian model average (20) as the reference model (see Sec. 2.5). In the variable selection context, 
the model averaging corresponds to a spike-and-slab type prior (Mitchell and Beauchamp, 1988) 
which is often considered as the “gold standard” and has been extensively used for linear models 
(see, e.g., George and McCulloch, 1993, 1997; Raftery et ah, 1997; Brown et ah, 2002; Lee et ah, 
2003; O’Hara and Sillanpaa, 2009; Narisetty and He, 2014) and extended and applied to regression 
for over one million variables (Peltola et ah, 2012a,b). However, we emphasize that any other model 
or prior could be used as long as we believe it reflects our best knowledge of the problem and 
allows convenient computation. For instance the Horseshoe prior (Carvalho et ah, 2009, 2010) has 
been shown to have desirable properties empirically and theoretically assuming a properly chosen 
shrinkage factor (Datta and Ghosh, 2013; van der Pas et ah, 2014). 

2.4.2 Projection predictive method 

A related but somewhat different alternative to the reference predictive method (previous section) 
is the projection approach. The idea is to project the information in the posterior of the reference 
model M* onto the candidate models M so that the predictive distribution of the candidate model 
remains as close to the reference model as possible. Thus the key aspect is that the candidate model 
parameters are determined by the fit of the reference model, not by the data. Therefore also the 
prior needs to be specified only for the reference model. The idea behind the projection is quite 
generic and Vehtari and Ojanen (2012) discuss the general framework in more detail. 

A practical means for doing the projection was proposed by Goutis and Robert (1998) and 
further discussed by Dupuis and Robert (2003). Given the parameter of the reference model 0*, the 
projected parameter 9^ in the parameter space of model M is defined via 

1 ” 

9^ = a.Tgmm-'^KL{p{y\xi,9*,M^)\\p{y\x^,9,M)). (13) 

e n 

The discrepancy between the reference model M* and the candidate model M is then defined to be 
the expectation of the divergence over the posterior of the reference model 

1 ” 

5{M4M) = - ^ E [KL{p{y \ x,, 9*, M,) || p(y | 9^,M))] . (14) 

^ • 1 
1—1 

The posterior expectation in (14) is in general not available analytically. Dupuis and Robert (2003) 
proposed calculating the discrepancy by drawing samples from the posterior of the refer¬ 
ence model, calculating the projected parameters individually according to (13), and then 
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approximating (14) as 

^ n S 

6{M4M) « —'^'^KL{p{y\xi,e*,M4 || p(y | Xi, 6*^, M)) . (15) 

2=1 S—1 


Also the optimization problem in (13) cannot typically be solved analytically, and a numerical 
optimization routine may be needed. However, for the simplest models like the Gaussian linear 
model, the analytical solution is available, see Appendix A. Moreover, even when the analytical 
solution does not exist, solving the optimization problem (13) in the case of generalized linear 
models is equivalent to finding the maximum likelihood parameters for the candidate model M with 
the observations replaced by the fit of the reference model (Goutis and Robert, 1998). 

The projected samples are used for posterior inference as usual. For example, the 

predictions of the candidate model M can be computed as 

1 ^ 

piy\x,D,M) = (16) 


which is the same as the usual Monte Garlo approximation to the predictive distribution, we simply 
use the projected samples as the posterior approximation. 

For deciding which model model to choose, Dupuis and Robert (2003) introduced a measure 
called relative explanatory power 


<j){M) = 1 - 


6{M4M) 

6iM4Mo) ’ 


(17) 


where Mq denotes the empty model, that is, the model that has the largest discrepancy to the 
reference model. In terms of variable selection, Mq is the variable free model. By definition, the 
relative explanatory power obtains values between 0 and 1, and Dupuis and Robert (2003) proposed 
choosing the simplest model with enough explanatory power, for example 90%, but did not discuss 
the effect of this threshold for the predictive performance of the selected models. We note that, in 
general, the relative explanatory power is an unreliable indicator of the predictive performance of the 
submodel. This is because the reference model is typically different from the true data generating 
model Mt, and therefore both M* and M may have the same discrepancy to Mt (that is, the same 
predictive ability) although the discrepancy between M* and M would be nonzero. 

Peltola et al. (2014) proposed an alternative way of deciding the model size based on cross- 
validation outside the searching process. In other words, in a AT-fold setting the searching is repeated 
K times each time leaving 1/K of the data for testing, and the performance of the found models 
are tested on this left-out data. Note that also the reference model is trained K times and each 
time its performance is evaluated on the left-out data. Thus, one can compare the utility of both 
the found models and the reference model on the independent data and estimate, for instance, how 
many variables are needed to get statistically indistinguishable predictions compared to the reference 
model. More precisely, if Um denotes the estimated expected utility of choosing m variables and m* 
denotes the estimated utility for the reference model, the models can be compared by estimating 
the probability 


Pr [u* — Urn < Aw], (18) 

that is, the probability that the utility difference compared to the reference model is less than 
Au > 0. Peltola et al. (2014) suggested estimating the probabilities above by using Bayesian 
bootstrap (Rubin, 1981) and reported results for all model sizes for Am = 0. 

The obvious drawback in this approach are the increased computations (as the selection and 
reference model fitting is repeated K times), but in Section 4.3 we demonstrate that this approach 
may be very useful when choosing the final model size. 
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2.5 Model space approach 

Bayesian formalism has a natural way of describing the uncertainty with respect to the used model 
specification given an exhaustive list of candidate models The distribution over the model 

space is given by 


p{M\D) (xp{D\M)p{M). (19) 

The predictions are then obtained from the Bayesian model averaging (BMA) solution 

L 

piy\D) = J2piy\D,Mi)p{Mi\D). (20) 

e=i 


Strictly speaking, forming the model average means adopting the Al-closed view, that is, assuming 
one of the candidate models is the true data generating model. In practice, however, averaging over 
the discrete model space does not differ in any sense from integrating over the continuous parameters 
which is the standard procedure in Bayesian modeling. Moreover, BMA has been shown to have 
a good predictive performance both theoretically and empirically (Raftery and Zheng, 2003) and 
especially in variable selection context the integration over the different variable combinations is 
widely accepted. See the review by Hoeting et al. (1999) for a thorough discussion of Bayesian 
model averaging. 

From a model selection point of view, one may choose the model maximizing (19) ending up 
with the maximum a posteriori (MAP) model. Assuming the true data generating model belongs 
to the set of the candidate models, MAP model can be shown to be the optimal choice under the 
zero-one utility function (utility being one if the true model is found, and zero otherwise). If the 
models are given equal prior probabilities, p{M) cx 1, finding the MAP model reduces to maximizing 
the marginal likelihood, also referred to as the type-II maximum likelihood. 

Barbieri and Berger (2004) proposed a related variable selection method for the Gaussian linear 
model and named it the Median probability model (which we abbreviate simply as the Median 
model). The Median model is defined as the model containing all the variables with marginal 
posterior probability greater than i. Let binary vector 7 = ( 7 ^,..., 7 P) denote which of the variables 
are included in the model ( 7 -^ = 1 meaning that variable j is included). The marginal posterior 
inclusion probability of variable j is then 

^ p{M\D), (21) 

M : 7-7 =1 


that is, the sum of the posterior probabilities of the models which include variable j. The Median 
model 7 med is then defined componentwise as 


7med = 


1 , 

0 , otherwise. 


( 22 ) 


The authors showed that when the predictors x = (x^,... ,x^) are orthogonal, that is when Q = 
E is diagonal, the Median model is the optimal choice. By optimal the authors mean the model 
whose predictions for future y are closest to the Bayesian model averaging prediction (20) in the 
squared error sense. The authors admit that the assumption of the orthogonal predictors is a strong 
condition that does not often apply. The Median model also assumes that the optimality is defined 
in terms of the mean predictions, meaning that the uncertainty in the predictive distributions is 
ignored. Moreover, the Median model is derived assuming Gaussian noise and thus the theory does 
not apply, for instance, to classification problems. 
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Unbiased, high variance 




Candidate models M 


Figure 1: Schematic illustration of an unbiased (left) and a biased (right) utility estimation method. 

Grey lines denote the utility estimates for different datasets black is the average, and 
red the true expected utility. In this case, the biased method is likely to choose better 
models (dashed lines) due to better tradeoff in bias and variance. 


3. Overfitting and selection induced bias 

As discussed in Section 2.1, the performance of a model is usually defined in terms of the expected 
utility (2). Many of the proposed selection criteria rewieved in Sections 2.2-2.5 can be thought of 
as estimates of this quantity even if not designed directly for this purpose. 

Consider a hypothetical utility estimation method. For a fixed training dataset D, its utility 
estimate gi = g{Mi,D) for model can be decomposed as 

gi = ue + ee, (23) 

where ui = u{Mi,D) represents the true generalization utility of the model, and eg = e{Mi,D) is 
the error in the utility estimate. Note that also ug depends on the observed dataset D, because 
favourable datasets lead to better generalization performance. If the utility estimate is correct on 
average over the different datasets Yj^igg] = or equivalently E£)[ei] = 0, we say the estimate 

g is unbiased, otherwise it is biased. The unbiasedness of the utility estimate is often considered 
as beneficial for model selection. However, the unbiasedness is intrinsically unimportant for model 
selection, and a successful model selection does not necessarily require unbiased utility estimates. 
To see this, note that the only requirement for a perfect model selection criterion is that the higher 
utility estimate implies higher generalization performance, that is gg > gk implies ug > Uk for all 
models Mg and M^. This condition can be satisfied even if ^D[g(\ ^D\ug]. 

To get an idea how the bias and variance properties of a utility estimate affect the model selection, 
see Figure 1. The left plot shows an imaginary prototype of an unbiased but high variance utility 
estimation method. The grey lines represent the estimated utilities for each model M with different 
data realizations. On average (black) these curves coincide with the true expected utility over all 
datasets (red). However, due to the high variance, the maximization of the utility estimate may lead 
to choosing a model with nonoptimal expected true utility (the maxima become scattered relatively 
far away from the true optimum). We refer to this phenomenon of choosing a nonoptimal model due 
to the variance in the utility estimates as overfitting in model selection. In other words, the selection 
procedure fits to the noise in the utility estimates and therefore it is expected that the chosen model 
has a nonoptimal true utility. The left plot also demonstrates that, even though the utility estimates 
are unbiased for each model before the selection, the utility estimate for the selected model is no 
longer unbiased and is typically optimistic (the maxima of the grey lines tend to lie over the average 
curve). We refer to this as the selection induced bias. 
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The right plot shows a biased utility estimation method that either under or overestimates the 
ability of most of the models. However, due to smaller variance, the probability of choosing a model 
with better true performance is significantly increased (the maxima of the estimates focus closer to 
the true optimum). This example demonstrates that even though the unbiasedness is beneficial for 
the performance evaluation of a particular model, it is not necessarily important for model selection. 
For the selection, it is more important to be able to rank the competing models in an approximately 
correct order with a low variability. 

The overfitting in model selection and the selection induced bias are important concepts that 
have received relatively little attention compared to the vast literature on model selection in general. 
However, the topic has been discussed for example by Rencher and Pun (1980), Ambroise and 
McLachlan (2002), Reunanen (2003), Varma and Simon (2006), and Cawley and Talbot (2010). 
These authors discuss mainly the model selection using cross-validation, but the ideas apply also 
to other utility estimation methods. As discussed in Section 2.2, cross-validation gives a nearly 
unbiased estimate of the generalization performance of any given model, but the selection process 
may overfit when the variance in the utility estimates is high (as depicted in the left plot of Figure 1). 
This will be demonstrated empirically in Section 4. The variance in the utility estimate is different 
for different estimation methods but may generally be considerable for small datasets. The amount 
of overfitting in selection increases with the number of models being compared, and may become a 
problem for example in variable selection where the number of candidate models grows quickly with 
the number of variables. 

4. Numerical experiments 

This section compares the methods presented in Section 2 in practical variable selection problems. 
Section 4.1 discusses the used models and Sections 4.2 and 4.3 show illustrative examples using 
simulated and real world data, respectively. The reader is encouraged to go through the simulated 
examples first as they illustrate most of the important concepts with more detailed discussion. In 
Section 4.4 we then discuss the use of cross-validation for guiding the model size selection and for 
performance evaluation of the finally selected model. Finally, Section 4.5 provides a short note on 
the computational aspects. 

4.1 Models 

We will consider both regression and binary classification problems. To reduce the computational 
burden involved in the experiments, we consider only linear models. For regression, we apply the 
standard Gaussian model 

y I a;, w, cr^ ~ N, 
w I cr^,r^ ~ N(0,r^(T^/), 

’ ^ ’ (24) 

a ~ Inv-Gamma(ao.5 /3cr), 

~ Inv-Gamma(aT, /dr), 

where x is the p-dimensional vector of inputs, w contains the corresponding weights and is the 
noise variance. For this model, most of the computations can be obtained analytically because for 
a given hyperparameter the prior is conjugate. Since it is difficult to come up with a suitable 
value for the weight regularising variance r^, it is given a weakly informative inverse-gamma prior 
and integrated over numerically. For the binary classification, we use the probit model 

y\x,w ^ Ber(<I>(r(;^a;)), 

~ N(0,t^/), (25) 

~ Inv-Gamma(Q;rj/3 t), 
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where $(•) denotes the cumulative density of the standard normal distribution. Again, a weakly 
informative prior for is used. For this model, we use Markov chain Monte Carlo (MCMC) 
methods to obtain samples from the posterior of the weights to get the predictions. For both 
models (24) and (25) we include the intercept term by augmenting a constant term in the input 


vector X = {1, ,..., x^) and a corresponding term in the weight vector w = {w^, 


"). The 


exact values used for the hyperparameters ar, /3 t , Oicr, fia will be given together with the dataset 
descriptions in Sections 4.2 and 4.3. 

Since we are considering a variable selection problem, the submodels have different number of 
input variables and therefore different dimensionality for x and w. For notational convenience, the 
binary vector 7 = ( 7 *^, 7 ^,..., 7 ^) denoting which of the variables are included in the model is 
omitted in the above formulas. Both in (24) and (25) the model specification is the same for each 
submodel 7 , only the dimensionality of x and w change. The reference model M* is constructed as 
the BMA (20) from the submodels using the reversible jump MCMC (RJMCMC) (Green, 1995), 
which corresponds to a spike-and-slab prior for the full model. For an additional illustration using 
a hiearchical shrinkage prior, see Appendix B. For the model space we use the prior 


7 J I 7]- 


Ber( 7 r), j = l,...,p, 
TT ^ Beta(a, b). 


(26) 


Here parameters a and b adjust the prior beliefs about the number of included variables. We set 
7 ° = 1 , that is, the intercept term is included in all the submodels. Also for a and 6 , the exact 
values used will be given together with the dataset descriptions in Sections 4.2 and 4.3. 

4.2 Simulated data 

We first introduce a simulated variable selection experiment which illustrates a number of important 
concepts and the main differences between the different methods. The data is distributed as follows 


X ~ N(0, R), R € 
y\x ^ l^(^w~^x, cr^), cr^ = 1 . 


(27) 


We set the total number of variables to p = 100. The variables are divided into groups of 5 variables. 
Each variable x^ has a zero mean and unit variance and is correlated with other variables in the same 
group with coefficient p but uncorrelated with variables in the other groups (the correlation matrix 
R is block diagonal). The variables in the first three groups have weights = 

(^, 0.5^, 0.25^) while the rest of the variables have zero weight. Thus there are 15 relevant and 85 
irrelevant variables in the data. The constant ^ adjusts the signal-to-noise ratio of the data. To 
get comparable results for different levels of correlation p, we set ^ so that cr^/Varjy] = 0.3. For 
p = 0, 0.5,0.9 this is satisfied by setting approximately ^ = 0.59,0.34,0.28, respectively. 

The experiments were carried out by varying the training set size n = 100, 200,400 and the 
correlation coefficient p = 0,0.5, 0.9. We used the regression model (24) with prior parameters 
Ut = j3r = Cia = Per = 0.5. The posterior inference did not seem to be sensitive to these choices. 
As the reference model M*, we used the BMA (20) over the different input combinations with prior 
o = 1, & = 10 for the number of inputs (26). For each combination of (n,p), we performed the 
variable selection with each method listed in Table 2 for 50 different data realizations. As a search 
heuristic, we used the standard forward search, also known as the stepwise regression. In other 
words, starting from the empty model, at each step we select the variable that increases the utility 
estimate (like CV, WAIC, DIC, etc.) the most. The Median and MAP model where estimated from 
the RJMCMC samples that were drawn to form the BMA (as discussed in Section 4.1). 


12 



Bayesian model selection methods 


Table 2: Compared model selection methods for the experiments. MAP and Median models are 
estimated from the RJMCMC samples, for other methods the searching is done using 
forward searching (at each step choose the variable that improves the objective function 
value the most). The methods are discussed in Section 2. 


Abbreviation 

Method 

CV-10 

10-fold cross-validation optimization (3) 

WAIC 

WAIC optimization (4) 

Die 

Die optimization (6) 

L2 

L^-criterion optimization (8) 

L2-CV 

L^.^,-criterion optimization (9) 

h2-k 

L^-criterion optimization with k = 1 (10) 

MAP 

Maximum a posteriori model 

MPP/Median 

Sort the variables according to their marginal posterior probabilities (MPP), 
choose all with probability 0.5 or more (Median) (22) 

BMA-ref 

Posterior predictive discrepancy minimization from BMA (12), choose smallest 
model having 95% explanatory power (17) 

BMA-proj 

Projection of BMA to submodels (15), choose smallest model having 95% ex¬ 
planatory power (17) 


The found models were then tested on an independent test set of size h = 1000. As a proxy for 
the generalization utility (2), we use the mean log predictive density (MLPD) on the test set 

1 " 

MLPD(M) = ^ Vlogp(y,|i„7^,M). (28) 

n 

i=l 

To reduce variance over the different data realizations and to better compare the relative performance 
of the different methods, we report the utilities of the selected submodels M with respect to the 
gold standard BMA solution M* 

AMLPD(M) = MLPD(M)-MLPD(M*). (29) 

On this relative scale zero indicates the same predictive performance as the BMA and negative 
values worse (and positive values better, correspondingly). A motivation for looking at the relative 
performance (29) is that, as we shall see shortly, the selection typically introduces loss in the pre¬ 
dictive accuracy, and we want to assess which of the selection methods are able to find the simplest 
models with performance close to the BMA. 

Figure 2 shows the average number of selected variables and the test utilities of the selected 
models in each data setting with respect to the BMA. First, a striking observation is that none of 
the methods is able to hnd a model with better predictive performance than the BMA. From the 
predictive point of view, model averaging yields generally the best results on expectation, and one 
should not expect to do better by selection. This result is in perfect accordance with what is known 
about the good performance of the BMA (Hoeting et ah, 1999; Raftery and Zheng, 2003). Thus, 
the primary motivation for selection should be the simplification of the model without substantially 
compromising the predictive accuracy, rather than trying to improve over the predictions obtained 
by taking into account the model uncertainty. 

Second, for the smallest dataset size many of the methods perform poorly and choose models 
with bad predictive performance, comparable or even worse than the model with no variables at all 
(the dotted lines). This holds for CV-10, WAIC, DIG, L2, L2-CV, and L2-fc, and the conclusion 


13 





PlIRONEN, VeHTARI 


Variables selected 


AMLPD(Mfe) 


n = 100 



CV-10 

WAIC 

Die 

L2 

L2-CV 

L2-fc 

MAP 

Median 

BMA-ref 

BMA-proj 


n = 200 


CV-10 

WAIC 

DIG 

L2 

L2-CV 

L2-fc 

MAP 

Median 

BMA-ref 

BMA-proj 


0 25 50 75 100 - 0.6 - 0.4 - 0.2 0 



n = 400 




100 - 0.6 


- 0.4 


•• 
«• 
• • 


- 0.2 


CV-10 

WAIC 

Die 

L2 

L2-CV 

L2-fc 

MAP 

Median 

BMA-ref 

BMA-proj 


Figure 2: Simulated data: Average forward search paths for some of the selection methods for 
different training set sizes n when p = 0.5. Red shows the CV utility (10-fold) and black 
the test utility with respect to the BMA (29) after sorting the variables, as a function of 
number of variables selected averaged over the 50 different data realizations. The difference 
between these two curves illustrates the selection induced bias. The dotted vertical lines 
denote the average number of variables chosen with each of the methods (see Table 2). 


covers all the levels of correlation between the variables (blue, red and green circles), albeit the 
high dependency between the variables somewhat improves the results. The observed behaviour is 
due to overfitting in the selection process (as we will show below). Due to scarce data, the high 
variance in the utility estimates leads to selecting overfitted models as discussed in Section 3. These 
methods perform reasonably only for the largest dataset size n = 400. MAP, Median, BMA-ref, 
and BMA-proj perform significantly better, choosing smaller models with predictive ability closer 
to the BMA. A closer inspection reveals that out of these four, BMA-proj performs best in terms 
of the predictive ability especially for the smallest dataset size n = 100, but the better predictive 
accuracy is partially due to selecting more variables than the other three methods. Note also that 
for BMA-proj the predictions are computed using the projected samples (Eq. (16)), whereas for all 
the other methods the parameter estimation is done by fitting the submodels to data. Later in this 
section we will show that the parameter estimation using the projection can play a considerable role 
in achieving improved predictive accuracy for the submodels, and the good performance of BMA- 
proj is not simply due to superior ordering of the variables, in fact MPP may perform even better 
in this aspect (see discussion related to Figures 5 and 6). 

To get more insight to the problem, let us examine more closely how the predictive performance 
of the submodels change when variables are selected. Figure 3 shows the CV and test utilities after 
sorting the variables, as a function of the number of variables selected along the forward search path 
when p = 0.5. The CV-utility (10-fold) is computed within the data used for selection (n points), 
and the test utility on independent data (note that computing the CV-curve for BMA-proj requires 
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Figure 3: Simulated data: Average forward search paths for some of the selection methods for 
different training set sizes n when p = 0.5. Red shows the CV utility (10-fold) and black 
the test utility for the submodels with respect to the BMA (29) as a function of number of 
variables selected averaged over the 50 different data realizations. The difference between 
these two curves illustrates the selection induced bias. The dotted vertical lines denote 
the average number of variables chosen with each of the methods (see Table 2). 


cross-validating the BMA and performing the projection for the submodels separately for each fold). 
The search paths for CV-10 (top row) demonstrate the overfitting in model selection; starting from 
the empty model and adding variables one at a time one finds models that have high CV utility 
but much worse test utility. In other words, the performance of the models at the search path is 
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Figure 4: Simulated data: Variability in the predictive performance of the found submodels with 
respect to the BMA (29) along the forward search path as a function of number of variables 
selected for the same methods as in Figure 3 for different training set sizes n when p = 0.5. 
The grey lines show the test utilities for the different data realizations and the black line 
denotes the average (the black lines are the same as in Figure 3). The dotted vertical lines 
denote the average number of variables chosen. 


dramatically overestimated and the gap between the two curves denotes the selection induced bias. 
Yet in other words, after selection (sorting the variables) the CV utility is an optimistic estimate for 
the selected models. Note, however, that for the empty model and the model with all the variables 
the CV utility and the test utility are on average almost the same because these models do not 
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involve any selection. The overfitting in the selection process decreases when the size of the training 
set grows because the variance of the error term in decomposition (23) becomes smaller, but the 
effect is still visible for n = 400. The behaviour is very similar also for WAIC, DIG, L2, L2-CV and 
L2-fc (the results for the last three are left out to save space). 

Ordering the variables according to their marginal posterior probabilities (MPP) and selecting 
the Median model works much better than CV-10, leading to selection of smaller models with 
good predictive performance. However, even better results are obtained by using the reference 
model approach, especially the projection (BMA-proj). The results clearly show that the projection 
approach is much less vulnerable to the overfitting than CV, WAIC and DIG, even though the CV 
utility is still a biased estimate of the true predictive ability for the chosen models. Even for the 
smallest dataset size, the projection is able to find models with predictive ability very close to the 
BMA with about 10-15 variables on average. Moreover, the projection has the inherent advantage 
over the other methods performing reasonably well (like MPP/Median) that when more variables 
are added, the submodels get ever closer to the reference model (BMA), thus avoiding the dip in the 
predictive accuracy apparent with the other methods around 10 variables. This is simply because 
the submodels are constructed to be similar than the model averaging solution which yields the best 
results (this point will be further discussed below, see discussion related to Figure 6). 

Figure 4 shows the variability in the performance of the selected models for the same selection 
methods as in Figure 3. The grey lines denote the test utilities for the selected models as a function 
of number of selected variables for different data realizations and the black line denotes the average 
(same as in Figure 3). For small training set sizes the variability in the predictive performance 
of the selected submodels is very high for GV-10, WAIG, DIG, and MPP. The reference model 
approach, especially the projection, reduces the variability substantially finding sparse submodels 
with predictive performance close to the BMA in all the data realizations. This is another property 
that makes the projection approach appealing. Figure 4 further emphasizes how difficult it is to 
improve over the BMA in predictive terms; most of the time the model averaging yields better 
predictions than any of the found submodels. Moreover, even when better submodels are found 
(the cases where the grey lines exceed the zero level), the difference in the predictive performance is 
relatively small. 

Although our main focus is on the predictive ability of the chosen models, we also studied how 
the different methods are able to choose the truly relevant variables over the irrelevant ones. Here 
by “relevant” we mean those 15 variables that were used to generate the output y even though 
it might not be completely clear how the relevance should be defined when there are correlations 
between the variables (for example, should we select one or both of two correlating variables which 
both correlate with the output). Figure 5 shows the proportion of relevant variables chosen (vertical 
axis) versus proportion of irrelevant variables chosen (horizontal axis) on average (the larger the 
area under the curve, the better). In this aspect, ordering the variables according to their marginal 
probabilities seems to work best, slightly better than the projection. The other methods seem 
to perform somewhat worse. Interestingly, although the projection does not necessarily order the 
variables any better than the marginal posterior probability order, the predictive ability of the 
projected submodels is on average better and varies less as Figure 4 demonstrates. 

To explain this behaviour, we did one more experiment and studied the difference of learning 
the submodel parameters by the projection from the BMA compared to fitting the submodels to the 
data. We performed this analysis both when the selection was done by the marginal probabilities 
or by forward search minimizing the discrepancy to the BMA, see Figure 6. The results show 
that constructing the submodels by projection improves the results regardless of which of the two 
methods is used to sort the variables, and in this example, there seems to be little difference in the 
final results as long as projection is used to construct the submodels. 

This effect can be explained by transimission of information from the reference model. Recall 
that the BMA corresponds to setting the sparsifying spike-and-slab prior for the full model. Because 
the prior information is transmitted also to the submodels in the projection, it is natural that 
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Figure 5: Simulated data: Proportion of relevant (vertical-axis) versus proportion of irrelevant vari¬ 
ables chosen (horizontal axis) for the different training set sizes n. The data had 100 
variables in total with 15 relevant and 85 irrelevant variables, relevant being defined as 
a variable that was used to generate the output y. The colours denote the correlation 
level between the variables (see the legend). The curves are averaged over the 50 data 
realizations. 


the submodels benefit from this (compared to the Gaussian prior in (24)) especially when some 
irrelevant variables have been included. Furthermore, the noise level is also learned from the full 
model (see Eq. (31)), which reduces overfitting of the submodels as the full model best represents 
the uncertainties related to the problem and best captures the correct the noise level. More detailed 
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Figure 6: Simulated data: The average test utility with respect to the BMA (29) as a function of 
number of variables selected when the submodel parameters are learned by projection from 
the BMA (solid line) and by standard fitting to the data (dotted line). The projection 
improves the performance of the submodels regardless of whether the variables are sorted 
by their marginal posterior probabilities (top row) or by a forward search minimizing the 
discrepancy to the BMA (bottom row). 


analysis of these effects would be useful, but we do not focus on it more in this paper and leave it 
to the future research. 

To summarize, it clearly appears that the full model averaging solution produces best predictive 
results, and the projection appears the most robust method for simplifying the full model without 
losing much predictive accuracy. However, the number of variables actually selected depends on 
the arbitrary 95% explanatory power rule, and although it seems work quite well for the examples 
above, it does not always lead to optimal results (see the real world examples in the next section). 
We discuss this problem and a possible solution further in Section 4.4. 

4.3 Real world datasets 

We also studied the performance of the different methods on several real world datasets. Five publicly 
available^ datasets were used and they are summarized in Table 3. One of the datasets deals with 
regression and the rest with binary classification. As a preprocessing, we normalized all the input 
variables to have zero mean and unit variance. For the Crime dataset we also log-normalized the 
original non-negative target variable (crimes per population) to get a real-valued and more Gaussian 
output. From this dataset we also removed some input variables and observations with missing values 
(the given p and n in Table 3 are after removing the missing values). We will not discuss the datasets 
in detail but refer to the sources for more information. 

For the regression problem we applied the Gaussian regression model (24) and for the classifica¬ 
tion problems the probit model (25). The prior parameters in each case are listed in Table 3. For 
all the problems we used relatively uninformative priors for the input weights (and measurement 
noise). As the reference model, we again used the BMA solution estimated using reversible jump 

1. The first three datasets are available at the UCI Machine Learning repository https://archive.ics.uci.edu/ 
ml/index.html. 

Ovarian cancer dataset can be found at http://www.dcs.gla.ac .uk/~srogers/lpd/lpd.html and the Colon cancer 
data at http://genomics-pubs.princeton.edu/oncology/affydata/index.html. 
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Table 3: Summary of the real world datasets and used priors, p denotes the total number of input 
variables and n is the number of instances in the dataset (after removing the instances with 
missing values). The classification problems deal all with a binary output variable. 


Dataset 

Type 

P 

n 

Prior parameters 

Crime 

Regression 

102 

1992 

O^T (^T 

= 0.5, a^r = = 0.5, 0 = 6 = 2 

Ionosphere 

Classification 

33 

351 

CYf /?-7- 

= 0.5, 0 = 6 = 2 

Sonar 

Classification 

60 

208 

CXt - i^T 

= 0.5, 0 = 6 = 2 

Ovarian cancer 

Classification 

1536 

54 

Otr — /?r 

= 2, o = 1, 6= 1200 

Colon cancer 

Classification 

2000 

62 

CXt — i^T 

= 2, o = 1, 6 = 2000 


Crime Ionosphere Sonar Ovarian 


Colon 



Figure 7: Real datasets: Prior and posterior probabilities for the different number of variables (top 
row) and marginal posterior probabilities for the different variables sorted from the most 
probable to the least probable (bottom row). The posterior probabilities are given with 
95% credible intervals estimated from the variability between different RJMCMC chains. 
The results are calculated using the full datasets (not leaving any data out for testing). 
For Ovarian and Colon datasets the plots are truncated at 30 variables. 


MCMC. For the first three problems (Crime, Ionosphere, Sonar) we used a very uninformative prior 
for the number of input variables (i.e., a = b = 2) because there was basically no prior information 
about the sparsity level. For the last two datasets (Ovarian and Colon) for which p ^ n we had 
to use priors that favor models with only a few variables to avoid overfitting. Figure 7 shows the 
estimated posterior probabilities for different number of variables (top row) and marginal posterior 
probabilities for the different inputs (bottom row) for all the datasets. Although these kind of plots 
may give some idea about the variable relevancies, it is still often difficult to decide which variables 
should be included in the model and what would be the effect on the predictive performance. 

We then performed the variable selection using the methods in Table 2 except the ones based on 
the squared error (L2, L2-CV, L2-fc) were not used for the classification problems. For Ovarian and 
Colon datasets, due to large number of variables, we also replaced the 10-fold-CV by the importance 
sampling LOO-CV (IS-LOO-CV) to reduce the computation time. For these two datasets we also 
performed the forward searching only up to 10 variables. To estimate the predictive ability of the 
chosen models, we repeated the selection several times each time leaving part of the data out and 
then measuring the out-of-sample performance using these observations. The Crime dataset was 
sufficiently large (n = 1992) to be splitted into training and test sets. We repeated the selection for 
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50 random splits each time using n = 100, 200,400 points for training and the rest for testing. This 
also allowed us to study the effect of the training set size. For Ionosphere and Sonar we used 10-fold 
cross-validation, that is, the selection was performed 10 times each time using 9/10 of the data 
and estimating the out-of-sample performance with the remaining 1/10 of the data. For Ovarian 
and Colon datasets, due to few observations, we used leave-one-out cross-validation for performance 
evaluation (the selection was performed n times each time leaving one point out for testing). Again, 
we report the results as the mean log predictive density on the independent data with respect to the 
BMA (29). 

Figure 8 summarizes the results. The left column shows the average number of variables selected 
and the right column the estimated out-of-sample utilities for the chosen models (out-of-sample 
utilities are estimated using hold-out samples not used for selection as explained above). The results 
are qualitatively very similar to those obtained for the simulated experiments (Sec. 4.2). Again 
we conclude that the BMA solution gives better predictions than any of the selection methods 
when measured on independent data. Moreover, the results demonstrate again that model selection 
using CV, WAIC, DIG, L2, L2-CV, or L2-fc is liable to overfitting especially when the dataset is 
small compared to the number of variables. Overall, MAP and Median models tend to perform 
better but show non-desirable performance on some of the datasets. Especially the Median model 
performs badly for Ovarian and Colon datasets where almost all the variables have marginal posterior 
probability less than 0.5 (depending on the split into training and validation sets). The projection 
(BMA-proj) shows the most robust performance choosing models with predictive ability close to the 
BMA for all the datasets. 

Figure 9 shows the CV (red) and out-of-sample (black) utilities for the chosen models as a 
function of number of chosen variables for the classification problems. CV-utilities (10-fold) are 
computed after sorting the variables within the same data used for selection, and the out-of-sample 
utilities are estimated on hold-out samples not used for selection as explained earlier. This figure is 
analogous to Figure 3 showing the magnitude of the selection induced bias (the difference between 
the red and black lines). Especially for the last three datasets (Sonar, Ovarian, Colon) the selection 
induced bias is considerable for all the methods, which emphasizes the importance of validating the 
variable searching process in order to avoid bias in performance evaluation for the found models. 
Overall, the projection (BMA-proj) appears to find models with best out-of-sample accuracy for a 
given model complexity, albeit for Ovarian dataset choosing about five most probable inputs (MPP) 
would perform even better. Moreover, the uncertainty in the out-of-sample performance for a given 
number of variables is also the smallest for the projection over all the datasets. 

The same applies to the Crime dataset, see Figure 10. For any given number of variables the 
projection is able to find models with predictive ability closest to the BMA and also with the least 
variability, the difference to the other methods being the largest when the dataset size is small. For 
Crime data, some additional results using hiearchical shrinkage prior for the full model are presented 
in Appendix B.2. 
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Figure 8: Real datasets: The number of selected variables (left column) and the estimated out-of- 
sample utilities of the selected models (right column) on average and with 95% credible 
intervals for the different datasets. The out-of-sample utilities are estimated using inde¬ 
pendent data not used for selection (see text) and are shown with respect to the BMA (29). 
The dotted line denotes the performance of the empty model (the intercept term only). 
For Ovarian and Colon datasets the searching was performed only up to 10 variables 
although both of these datasets contain many more variables. 
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Figure 9: Classification datasets: CV (red) and out-of-sample (black) utilities on average for the 
selected submodels with respect to the BMA (29) along the forward search path as a 
function of number of variables selected. CV utilities (10-fold) are computed within the 
same data used for selection and the out-of-sample utilities are estimated on hold-out 
samples not used for selection (see text) and are given with 95% credible intervals. The 
dotted vertical lines denote the average number of variables chosen. CV optimization 
(top row) is carried out using 10-fold-CV for Ionosphere and Sonar, and IS-LOO-CV for 
Ovarian and Colon. 
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Figure 10: Crime dataset: Variability in the test utility of the selected submodels with respect to the 
BMA (29) along the forward search path as a function of number of variables selected. 
The selection is performed using n = 100, 200,400 points and the test utility is computed 
using the remaining data. The grey lines show the test utilities for the 50 different splits 
into training and test sets and the black line denotes the average. The dotted vertical 
lines denote the average number of variables chosen. 
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4.4 On choosing the final model size 

Although the search paths for the projection method (BMA-proj) seem overall better than for the 
other methods (Figures 3, 4, 9 and 10), the results also demonstrate difficulty in deciding the final 
model size; for instance, for Ionosphere and Sonar datasets the somewhat arbitrary 95% explanatory 
power rule chooses rather too few variables, but for Ovarian and Colon unnecessarily many variables 
(the out-of-sample utility close to the BMA can be obtained with fewer variables). The same applies 
for the Crime dataset with the smallest number of training points in = 100). As discussed in 
Section 2.4, a natural idea would be to decide the final model size based on the estimated out-of- 
sample utility (the black lines in Figures 9 and 10) which can be done by cross-validation outside 
the searching process. This opens up the question, does this induce a significant amount of bias in 
the utility estimate for the finally selected model? 

To assess this question, we performed one more experiment on the real world datasets by adding 
another layer of cross-validation to assess the performance of the finally selected models on indepen¬ 
dent data. In other words, the variable searching was performed using the projection, the inner layer 
of cross-validation (10-fold) was used to decide the model size and the outer layer to measure the 
performance of the finally selected models (10-fold for Ionosphere and Sonar, LOO-CV for Ovarian 
and Colon, and hold-out with different training set sizes for Crime). As the rule for deciding the 
model size, we selected the smallest number of variables satisfying 

Pr [AMLPD(m) >U\>a 

for different thresholds U and a. Here AMLPD(m) denotes the estimated out-of-sample utility for 
m variables in the inner validation. This probability is the same as (18), the inequality is merely 
organized differently. Here U denotes how much one is willing to sacrifice the predictive accuracy in 
order to reduce the number of variables, and a denotes how certain we want to be about not going 
below U. We estimate the above probability using the Bayesian bootstrap. 

Figure 11 shows the final expected utility on independent data for different values of U and a. for 
the different datasets. The results show that for a large enough credible level a = 0.95, the applied 
selection rule appears to be safe in the sense that the final expected utility remains always above the 
level of U (the dots stay above the diagonal line). This means that there is no substantial amount 
of selection induced bias at this stage, and the second level of validation is not necessarily needed. 
However, this does not always hold for smaller values of a (a = 0.5 and a = 0.05). 

We can make these results more intuitive by looking at an example. Consider the projection 
search path for the Sonar dataset in Figure 9 (bottom row, second column). Assume now that 
U = —0.01, meaning that we are willing to lose 0.01 in the MLPD compared to the BMA, which is 
about 5% of the difference between the BMA and the empty model. Since the grey bars denote the 
central 95% credible intervals, setting a = 0.975 would correspond to choosing the smallest model 
for which the lower limit of the grey bar falls above U = —0.01 (because then the performance of 
the submodel is greater than this with probability 0.975). This would lead to choosing about 35 
variables, and we could be quite confident that the final expected utility would be within 0.01 from 
the BMA. On the other hand, setting a = 0.025 corresponds to choosing the smallest model for 
which the upper limit of the grey bar falls above U = —0.01 (because then the performance of the 
submodel is greater than this with probability 0.025). This would lead to choosing only 3 variables, 
but in this case we cannot expect confidently that the final expected utility would be within 0.01 
from the BMA, and in fact it would be lower than this with high probability. 

Following the example above, one can get an idea of the effect of a and U to the final model 
size and the final expected utility. Generally speaking, U determines how much we are willing to 
compromise the predictive accuracy and a determines how condident we want to be about the final 
utility estimate. It is application specific whether it is more important to have high predictive 
accuracy or to reduce the number of variables at the cost of losing predictive accuracy, but a 
reasonable recommendation might be to use a > 0.95 and U to be 5% of the difference between the 
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Crime (n = 100) 


Crime (n = 200) 


Crime (n = 400) 



Threshold U 


Figure 11: Real datasets: Vertical axis shows the final expected utility on independent data with 
respect to the BMA (29) for the selected submodels when the searching is done us¬ 
ing the projection (BMA-proj) selecting the smallest number of variables m satisfying 
Pr [AMLPD(m) > C] > a, where AMLPD(to) denotes the estimated out-of-sample util¬ 
ity for m variables estimated using the CV (10-fold) outside the searching process (same 
as the black lines in Figure 9). The final utility is estimated using another layer of 
validation (see text). The dotted line denotes the utility for the empty model. When 
a = 0.95, the final utility remains equal or larger than U (the dots stay above the diagonal 
line) indicating that the applied selection rule does not induce bias in the performance 
evaluation for the finally selected model. 


reference model and the empty model. Based on Figure 11, this combination would appear to give 
predictive accuracy close to the BMA, and based on Figure 9 yield quite effective reduction in the 
number of variables (choosing about 20 variables for Ionosphere, 35 for Sonar, and 5-10 variables 
for Ovarian and Colon by visual inspection). 

As a final remark, one might wonder why the cross-validation works well if used only to decide 
the model size but poorly if used directly to optimize the variable combination (as depicted e.g. in 
Figure 3)? As discussed in Section 3, the amount of overfitting in the selection and the consequent 
selection bias depends on the variance of the utility estimate for a given model (over the data real¬ 
izations) and the number of candidate models being compared. For the latter reason, the overfitting 
is considerably smaller when cross-validation is used only to decide the model size by comparing 
only p + \ models (given the variable ordering), in contrast to selecting variable combination in the 
forward search phase among O(p^) models. Moreover, the utilities of two consecutive models at 
the search path are likely to be highly correlated, which further reduces the freedom of selection 
and therefore the overfitting at this point. It is for this same reason why cross-validation may yield 
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reasonable results when used to choose a single hyperparameter among a small set of values, in 
which case essentially only a few different models are being compared. 

Based on the results presented in this section, despite the increased computational effort, we 
believe the use of cross-validation on top of the variable searching is highly advisable both for 
choosing the final model size and giving a nearly unbiased estimate of the out-of-sample performance 
for the selected model. We emphasize the importance of this regardless of the method used for 
searching the variables, but generally we recommend using the projection given its overall superior 
performance in our experiments. 

4.5 Computational considerations 

We conclude with a remark on the computational aspects. The results in Sections 4.2 and 4.3 
emphasized that from the predictive point of view, the model averaging over the different submodels 
yields often better results than selection of any single model. Forming the model averaging solution 
over the variable combinations may be computationally challenging, but is quite feasible up to 
problems with a few thousand variables or less with, for instance, a straighforward implementation of 
the RJMCMC algorithm with simple proposal distributions. Scalability up to a million variables can 
be obtained with more sophisticated and efficient proposals (Peltola et ah, 2012b). The computations 
could also be sped up by using a hierarchical shrinkage prior such as the horseshoe (see Appendix B.l) 
for the regression weights in the full model, instead of the spike-and-slab (which is equivalent to the 
model averaging over the variable combinations). 

After forming the full reference model (either the BMA or using some alternative prior), the 
subsequent computations needed for the actual selection are typically less burdensome. Computing 
the MAP and the Median models from the MCMC samples from the model space is easy, but 
these cannot be computed with alternative priors. Constructing the submodels by projection and 
performing a forward search through the variable combinations is somewhat more laborous, but 
takes still usually considerably less time than forming the reference model. The projection approach 
can also be used with any prior for the full model, as the posterior samples are all that is needed 
(see Appendix B). 

The methods that do not rely on the construction of the full reference model (CV, WAIC, DIC, 
and the L^-variants) are typically computationally easier as they avoid the burden of forming the 
reference model in the first place. However, if one has to go through a lot of models in the search, 
the procedure is fast only if the fitting of the submodels is fast. This is the case for instance for 
the Gaussian linear model for which the posterior computations are obtained analytically, but in 
other problems like in classification, sampling the parameters separately for each submodel may be 
very expensive, and one may be forced to use faster approximations (such as the Laplace method or 
expectation propagation). On the other hand, it must be kept in mind that the possible computa¬ 
tional savings from avoiding the construction of the full model may come at the risk of overfitting 
and reduced predictive accuracy, as our results show. 

5. Conclusions 

In this paper we have shortly reviewed many of the proposed methods for Bayesian predictive 
model selection and illustrated their use and performance in practical variable selection problems 
for regression and binary classification, where the goal is to select a minimal subset of input variables 
with a good predictive accuracy. The experiments have been carried out using both simulated and 
several real world datasets. 

The numerical experiments show that the overfitting in the selection may be a potential problem 
and hinder the model selection considerably. This is the case especially when the dataset is small 
(high variance in the utility estimates) and the number of models under comparison large (large 
number of variables). Especially vulnerable methods for this type of overfitting are CV, WAIC, DIC 
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and other methods that rely on data reuse and have therefore relatively high variance in the utility 
estimates. From the predictive point of view, better results are generally obtained by accounting for 
the model uncertainty and forming the full encompassing (reference) model with all the variables 
and best possible prior information on the sparsity level. Our results showed that Bayesian model 
averaging (BMA) over the candidate models yields often the best results on expectation, and one 
should not expect to do better by selection. This agrees with what is known about the good 
performance of the BMA (Hoeting et ah, 1999; Raftery and Zheng, 2003). 

If the full model is too complex or the cost for observing all the variables is too high, the model 
can be simplihed most robustly by the projection method which is considerably less vulnerable to 
the overfitting in the selection. The advantage of the projection approach comes from taking into 
account the uncertainty in forming the full encompassing model and then finding a simpler model 
which gives similar answers as the full model. Overall, the projection framework outperforms also 
the selection of the most probable variables or variable combination (Median and MAP models) 
being able to best retain the predictive ability of the full model while effectively reducing the model 
complexity. The results also demonstrated that the projection does not only outperform the other 
methods on average but the variability over the different data realizations is also considerably smaller 
compared to the other methods. In addition, the numerical experiments showed that constructing 
the submodels by the projection from the full model may improve the predictive accuracy even when 
some other strategy, such as marginal probabilities, are used to rank the variables. 

Despite its advantages, the projection method has the inherent challenge of forming the reference 
model in the first place. There is no automated way of coming up with a good reference model which 
emphasizes the model critisism. However, as already stressed, incorporating the best possible prior 
information into the full encompassing model is formally the correct Bayesian way of dealing with 
the model uncertainty and often seems to also provide the best predictions in practice. In this study 
we used the model averaging over the variable combinations as the reference model, but similar 
results are obtained also with the hierarchical shrinkage prior (see Appendix B). 

Another issue is that, even though the projection method seems the most robust way of searching 
for good submodels, the estimated discrepancy between the reference model and a submodel is in 
general an unreliable indicator of the predictive performance of the submodel. In variable selection, 
this property makes it problematic to decide how many variables should be selected to obtain 
predictive performance close to the reference model, even though the minimization of the discrepancy 
from the reference model typically finds a good search path through the model space. 

However, the results show that this problem can be solved by using cross-validation outside the 
searching process, as this allows studying the tradeoff between the number of included variables 
and the predictive performance, which we believe is highly informative. Moreover, we demonstrated 
that selecting the number of variables this way does not produce considerable overfitting or selection 
induced bias in the utility estimate for the finally selected model, because the selection is conditioned 
on a greatly reduced number of models (see Sec. 4.4). While this still leaves the user the responsibility 
of deciding the final model size, we emphasize that this decision depends on the application and the 
costs of the inputs. Without any costs for the variables, we would simply recommend using them 
all and carrying out the full Bayesian inference on the model. 
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A. Projection for the linear Ganssian model 

Consider the single output linear Gaussian regression model with several input variables (24). For 
this model, the projected parameters (13) can be calculated analytically. Assume now that the 
reference model M* is the full model with all the inputs with any prior on the weights w and noise 
variance Given a sample {w, cr^) from the posterior of the full model, the projected parameters 
are given by (see the derivation in Piironen and Vehtari, 2015) 


WR = {X±^X±)-'^X±^Xw, 

(30) 

cr]_ = + —(Xia — X±ia±)~''(Xw — X±w±), 

(31) 


where X = (a;^, ■ ■ ■ ,xlf) denotes the n x p predictor matrix of the full model, and X± the contains 
those columns of X that correspond to the submodel M we are projecting onto. The associated 
KL-divergence (for this particular sample) is given by 

d{w,a^) = \\og^. (32) 

2 

The projection equations (30) and (31) have a nice interpretation. The projected weights (30) 
are determined by the maximum likelihood solution with the observations y replaced by the fit of 
the full (reference) model / = Xw. The projected noise variance (31) is the noise level of the full 
model plus the mismatch between the reference and the projected model. 

As discussed in Section 2.4.2, we draw a sample {u’s 7 cr^}f=i from the posterior of the reference 
(full) model, compute the projected parameters _L}f=i and associated KL-divergences ac¬ 

cording to Equations (30), (31) and (32), and then estimate the discrepancy between the full and 
submodel as 

1 S 

6iM4M) = -Y,d{ws,a^,). (33) 

For a given number of variables, we then seek for a variable combination that gives minimal discrep¬ 
ancy. This procedure will produce a parsimonious model with exactly zero weights for the variables 
that are left out and predictive distribution similar to the full model. The predictive distribution of 
the submodel (16) is given by 


p{y\x,D,M) = \ Ws,±^x, _l). (34) 

B. Projection with hierarchical shrinkage prior 

Here we will briefly illustrate that the projection approach (Sec. 2.4.2) can also be used with other 
priors than the spike-and-slab (corresponding to the Bayesian model averaging, BMA) that we used 
in the main experiments in Section 4. 
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B.l Hierarchical shrinkage prior 


Consider again the linear Gaussian regression model with several inputs (24). A hierarchical shrink¬ 
age (HS) prior for the regression weights w = {wi ,..., Wp) can be obtained as 


Wj 


\K,T‘ 


N(0,A2t2 


(35) 


where t^(-) denotes the half-Student-t prior with v degrees of freedom. Intuitively, we expect the 
local variance parameters Af to be large for those inputs that have large weight, and small for those 
with negligible weight, while the global variance term adjusts the overall sparsity level. The 
shrinkage property of the prior (35) comes from the fact that the half-t prior evaluates to a positive 
constant at the origin and places therefore probability mass for small values of A^. Moreover, 
if the tails of the half-t densities are heavy enough, they allow some of the weights to remain 
unshrunk. The horseshoe prior (Carvalho et ah, 2009, 2010) is obtained by setting ly = 1, that is, 
by introducing half-Cauchy priors for the local scale parameters A^. The horseshoe prior has been 
shown to possess desirable theoretical properties and performance close to the gold standard BMA in 
practice (Carvalho et ah, 2009, 2010; Datta and Ghosh, 2013; van der Pas et ah, 2014). However, it 
is observed that when using Stan (Stan Development Team, 2015) for htting the model, the NUTS 
sampler can produce a lot of divergent transitions even after the warm-up period (Piironen and 
Vehtari, 2015). Therefore we use v = 2> which is observed to behave numerically well and yield good 
results (see Piironen and Vehtari, 2015, for more details). 


B.2 Crime dataset revisited 

Let us now revisit the Crime dataset from Section 4.3, which contains 1992 observations and p = 102 
predictor variables. We split the data randomly into two so that n = 1000 points are used for model 
training and variable selection, and the remaining h = 992 points are used for testing. We apply the 
regression model (24) with the HS prior (35) with v = i for the weights in the full model. Note that 
we use these hierarchical priors only for the weights of the nonconstant inputs. For the intercept 
term we use a weakly informative prior 


u;o~N(0,52). 

For the global scale parameter r and for the noise variance we use the following uninformative 
priors 


r~C+(0,l), 

OC 1 , 

where C+(-) denotes the half-Cauchy distribution. The full model is fitted by drawing S = 4000 
samples from the posterior using Stan (4 chains, 2000 samples per each, first halts discarded as a 
warmup). 

After htting the full model, we performed the forward variable selection as in Section 4 using the 
projection predictive method. In other words, by starting from the empty model, at each step we add 
the variable that decreases the discrepancy (33) to the full model the most. The performance of the 
found models was then studied on the test set. For illustration, we also cross-validated the selection 
within the training data. In other words we repeated the model htting and selection A" = 10 times 
each time leaving njK points out for validation, and evaluated the performance of the full model and 
the found submodels using these left-out data. This was done to illustrate that the cross-validation 
gives a reliable estimate of the generalization performance for a given number of variables if the 
whole search process is cross-validated (as discussed in Sec. 4.4). 
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Figure 12: Crime data with HS prior for the full model: Difference in the mean log predictive density 
(MLPD) and mean squared error (MSE) between the projected submodel and the full 
model as a function of number of chosen variables up to 50 variables. Black is the average 
over the K = IQ cross-validated searches within n = 1000 data points, grey bars denote 
the 95% credible interval, and green is the test performance on the remaining h = 992 
test points when the search is done using all the n = 1000 training data points. 


Figure 12 shows the difference in the mean log predictive density and mean squared error between 
the projected submodel and the full model as a function of number of added variables up to 50 
variables. The black line is the average over the K = 10 cross-validation folds and the green line 
shows the result when the fitting and searching is performed using all the training data and the 
performance is evaluated on the test data. As expected, there is a good correspondence between 
the cross-validated and test performance. For this dataset, most of the predictive ability of the 
full model is captured with about 5 variables, and 20 variables are enough for getting predictions 
indistinguishable from the full model for all practical purposes. These results are essentially the same 
that were obtained using the BMA as the reference model (Sec. 4.3), that is, the spike-and-slab prior 
for the full model. 
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